Automatic Data Layout Optimizations for GPUs

نویسندگان

Klaus Kofler

Biagio Cosenza

Thomas Fahringer

چکیده

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming Array-of-Structures (AoS) to Structure-of-Arrays (SoA). This paper presents an optimization infrastructure to automatically determine an improved data layout for OpenCL programs written in AoS layout. Our framework consists of two separate algorithms: The first one constructs a graph-based model, which is used to split the AoS input struct into several clusters of fields, based on hardware dependent parameters. The second algorithm selects a good per-cluster data layout (e.g., SoA, AoS or an intermediate layout) using a decision tree. Results show that the combination of both algorithms is able to deliver higher performance than the individual algorithms. The layouts proposed by our framework result in speedups of up to 2.22, 1.89 and 2.83 on an AMD FirePro S9000, NVIDIA GeForce GTX 480 and NVIDIA Tesla k20m, respectively, over different AoS sample programs, and up to 1.18 over a manually optimized program.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallelizing Branch-and-Bound on GPUs for Optimization of Multiproduct Batch Plants

Parallel implementation of the Branch-and-Bound (B&B) technique for optimization problems is a promising approach to accelerating their solution, but it remains challenging on Graphics Processing Units (GPUs) due to B&B’s irregular data structures and poor computation/communication ratio. The contributions of this paper are as follows: 1) we develop two basic implementations (iterative and recu...

متن کامل

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on NVIDIA GPUs. However, there are various performance-influencing factors specific to GPU architectures that need to be accurately characterized to effectively utilize...

متن کامل

Parallelizing Alternating Direction Implicit Solver on GPUs

We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves existing implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource constraints. As a result, our parallel ADI, which is based on PCR, no longer has the maximum domain ...

متن کامل

Data Layout Transformation for Structured-Grid Codes on GPU

We present data layout transformation as an effective performance optimization for memory-bound structuredgrid applications for GPUs. Structured grid applications are a class of applications that compute grid cell values on a regular 2D, 3D or higher dimensional regular grid. Each output point is computed as a function of itself and its nearest neighbors. Stencil code is an instance of this app...

متن کامل

Automatic Translation of Cuda to Opencl and Comparison of Performance Optimizations on Gpus

As an open, royalty-free framework for writing programs that execute across heterogeneous platforms, OpenCL gives programmers access to a variety of data parallel processors including CPUs, GPUs, the Cell and DSPs. All OpenCL-compliant implementations support a core specification, thus ensuring robust functional portabiity of any OpenCL program. This thesis presents the CUDAtoOpenCL source-to-s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Automatic Data Layout Optimizations for GPUs

نویسندگان

چکیده

منابع مشابه

Parallelizing Branch-and-Bound on GPUs for Optimization of Multiproduct Batch Plants

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

Parallelizing Alternating Direction Implicit Solver on GPUs

Data Layout Transformation for Structured-Grid Codes on GPU

Automatic Translation of Cuda to Opencl and Comparison of Performance Optimizations on Gpus

عنوان ژورنال:

اشتراک گذاری